{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Principal Componenet Analysis (PCA)\n",
    "\n",
    "The PCA algorithm is a dimensionality reduction algorithm which works really well for datasets which have correlated columns. It combines the features of X in linear combination such that the new components capture the most information of the data. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The model can take array-like objects, either in host as NumPy arrays or in device (as Numba or cuda_array_interface-compliant), as well as cuDF DataFrames as the input. \n",
    "\n",
    "For more information about cuDF, refer to the cuDF documentation: https://docs.rapids.ai/api/cudf/stable\n",
    "\n",
    "For more information about cuML's PCA implementation: https://rapidsai.github.io/projects/cuml/en/stable/api.html#cuml.PCA"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import cudf\n",
    "import numpy as np\n",
    "from cuml.datasets import make_blobs\n",
    "from cuml.decomposition import PCA as cuPCA\n",
    "from sklearn.decomposition import PCA as skPCA"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Define Parameters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "n_samples = 2**15\n",
    "n_features = 400\n",
    "\n",
    "n_components = 2\n",
    "whiten = False\n",
    "svd_solver = \"full\"\n",
    "random_state = 23"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generate Data\n",
    "\n",
    "### GPU"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "device_data, _ = make_blobs(n_samples=n_samples, \n",
    "                            n_features=n_features, \n",
    "                            centers=5, \n",
    "                            random_state=random_state)\n",
    "\n",
    "device_data = cudf.DataFrame.from_gpu_matrix(device_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Host"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Copy dataset from GPU memory to host memory.\n",
    "# This is done to later compare CPU and GPU results.\n",
    "host_data = device_data.to_pandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Scikit-learn Model\n",
    "\n",
    "### Fit"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "pca_sk = skPCA(n_components=n_components,\n",
    "               svd_solver=svd_solver,\n",
    "               whiten=whiten,\n",
    "               random_state=random_state)\n",
    "\n",
    "result_sk = pca_sk.fit_transform(host_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## cuML Model\n",
    "\n",
    "### Fit"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "pca_cuml = cuPCA(n_components=n_components,\n",
    "                 svd_solver=svd_solver,\n",
    "                 whiten=whiten,\n",
    "                 random_state=random_state)\n",
    "\n",
    "result_cuml = pca_cuml.fit_transform(device_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluate Results\n",
    "\n",
    "### Singular Values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "passed = np.allclose(pca_sk.singular_values_, \n",
    "                     pca_cuml.singular_values_.to_array(), \n",
    "                     atol=0.01)\n",
    "print('compare pca: cuml vs sklearn singular_values_ {}'.format('equal' if passed else 'NOT equal'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Explained Variance"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "passed = np.allclose(pca_sk.explained_variance_, \n",
    "                     pca_cuml.explained_variance_.to_array(), \n",
    "                     atol=1e-6)\n",
    "print('compare pca: cuml vs sklearn explained_variance_ {}'.format('equal' if passed else 'NOT equal'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Explained Variance Ratio"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "passed = np.allclose(pca_sk.explained_variance_ratio_, \n",
    "                     pca_cuml.explained_variance_ratio_.to_array(), \n",
    "                     atol=1e-6)\n",
    "print('compare pca: cuml vs sklearn explained_variance_ratio_ {}'.format('equal' if passed else 'NOT equal'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Components"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "passed = np.allclose(pca_sk.components_, \n",
    "                     np.asarray(pca_cuml.components_.as_gpu_matrix()), \n",
    "                     atol=1e-6)\n",
    "print('compare pca: cuml vs sklearn components_ {}'.format('equal' if passed else 'NOT equal'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Transform"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "passed = np.allclose(result_sk, np.asarray(result_cuml.as_gpu_matrix()), atol=1e-1)\n",
    "print('compare pca: cuml vs sklearn transformed results %s'%('equal'if passed else 'NOT equal'))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
